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Preface 



Welcome to the proceedings of PERVASIVE 2004, the 2 nd International Con- 
ference on Pervasive Computing and the premier forum for the presentation 
and appraisal of the most recent and most advanced research results in all fo- 
undational and applied areas of pervasive and ubiquitous computing. Conside- 
ring the half-life period of technologies and knowledge this community is facing, 
PERVASIVE is one of the most vibrant, dynamic, and evolutionary among the 
computer-science-related symposia and conferences. 

The research challenges, efforts, and contributions in pervasive computing 
have experienced a breathtaking acceleration over the past couple of years, 
mostly due to technological progress, growth, and a shift of paradigms in com- 
puter science in general. As for technological advances, a vast manifold of tiny, 
embedded, and autonomous computing and communication systems have star- 
ted to create and populate a pervasive and ubiquitous computing landscape, 
characterized by paradigms like autonomy, context- awareness, spontaneous in- 
teraction, seamless integration, self-organization, ad hoc networking, invisible 
services, smart artifacts, and everywhere interfaces. The maturing of wireless 
networking, miniaturized information-processing possibilities induced by novel 
microprocessor technologies, low-power storage systems, smart materials, and 
technologies for motors, controllers, sensors, and actuators envision a future 
computing scenario in which almost every object in our everyday environment 
will be equipped with embedded processors, wireless communication facilities, 
and embedded software to perceive, perform, and control a multitude of tasks 
and functions. Since many of these objects are already able to communicate and 
interact with global networks and with each other, the vision of context-aware 
“smart appliances” and “smart spaces” has already become a reality. Service 
provision is based on the ability of being aware of the presence of other objects 
or users, and systems can be designed in order to be sensitive, adaptive, and 
responsive to their needs, habits, and even emotions. With pervasive compu- 
ting technology embodied into real-world objects like furniture, clothing, crafts, 
rooms, etc., those artifacts also become the interface to “invisible” services and 
allow them to mediate between the physical and digital (or virtual) world via 
natural interaction - away from desktop displays and keyboards. All these ob- 
servations pose serious challenges to the conceptual architectures of computing, 
and the related engineering disciplines in computer science. PERVASIVE rises 
to those challenges. 

A program committee of 30 leading scientists, together with the help of ex- 
ternal expert reviewers, shaped the PERVASIVE 2004 scientific program, the 
incarnation of which you now hold in your hands. Upon the call for papers, 278 
submissions were received for consideration in the conference program - 212 for 
the paper track (including 8 tech- notes), 49 for the hot spot paper track, and 17 
for the video paper track. In the paper track, each submission was assigned for 
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review to at least three program committee members, who in turn often involved 
further experts in the review process, so that each paper received at least three 
(on average 3.27, at most 8) independent reviews. After a lively discussion in 
the program committee meeting on December 13, 2003, assessing the scientific 
quality and merits of each individual submission on top of the scoring it received 
from reviewers, 27 papers were accepted for presentation at PERVASIVE 2004 
(12.7% acceptance). One accepted paper had to be withdrawn by the authors 
for restricted corporate reasons. Out of the 27 papers 19 were accepted in the 
category regular papers and 8 in the category tech-notes. Tech-notes are not to 
be understood as short papers condensed into fewer pages, but are intended 
to present pointed results at a high level of technicality in a very focused and 
compact format. 

The PERVASIVE 2004 venue and presentation schedule was to some extent 
experimental, but appealing and promising. While an international doctoral col- 
loquium preceded the main conference on April 18-19 at the University of Linz, 
tutorials and workshops opened the PERVASIVE 2004 activities in Vienna on 
April 20. The workshop topics expressed a good blend of topical research is- 
sues emerging under the pervasive computing umbrella: Gaming Applications 
in Pervasive Computing Environments (Wl), Toolkit Support for Interaction in 
the Physical World (W2), Memory and Sharing of Experiences (W3), Computer 
Support for Human Tasks and Activities (W4), Benchmarks and a Database for 
Context Recognition (W5), SPPC: Security and Privacy in Pervasive Computing 
(W6), and Sustainable Pervasive Computing (W7). Technical paper sessions were 
scheduled from April 21 through April 23, highlighted by two very distinguished 
keynote speeches, and an inspiring banquet speech. A special PERVASIVE 2004 
Video Night event presented video contributions in a lively format in a mar- 
velous, historic place: the festival hall of the University of Vienna. All video 
clips are included in the PERVASIVE 2004 Video DVD. All doctoral colloquium 
papers, hot spot papers, and video papers are published in the “Advances in 
Pervasive Computing” book of the OCG (Vol. 176, ISBN 3-85403-176-9). 

We want to thank all the people on the program committee and the vol- 
unteer reviewers (listed on the following pages) with sincere gratitude for their 
valuable assistance in this very difficult task of reviewing, judging, and scoring 
the technical paper submissions, as well as for their upright and factual con- 
tributions to the final decision process. We particularly wish to thank Albrecht 
Schmidt (Ludwig-Maximilians-Universitat Miinchen) for being a very pragmatic 
workshop chair; Gabriele Kotsis (Johannes Kepler University Linz) for chairing 
the doctoral colloquium and for her pioneering work in making the colloquium 
ECTS credible; Horst Hortner from the AEC (Ars Electronica Center) Future 
Lab for chairing the video track, as well as his team for the support in getting 
the PERVASIVE 2004 Video DVD produced; Rene Mayrhofer and Simon Vogl 
(both Johannes Kepler University Linz) for chairing the tutorials track; and Ka- 
rin Anna Hummel (University of Vienna) and Rene Mayrhofer for their excellent 
work as publicity co-chairs. 
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From the many people who contributed to make PERVASIVE 2004 happen, 
our special thanks go to Gabriele Kotsis, president of the OCG (Oesterreichi- 
sche Computergesellschaft), and her team headed by Eugen Miihlvenzl for co- 
organizing this event. As in many previous events of this nature, she was the real 
“organizational memory” behind everything - PERVASIVE 2004 would not have 
come to happen without her help. Warmest thanks go to both the Rektor of the 
University of Linz, Rudolf Ardelt, and the Rektor of the University of Vienna, 
Georg Winckler, for hosting PERVASIVE 2004. For their invaluable support 
making PERVASIVE 2004 a first-rank international event we thank Reinhard 
Gobi (Austrian Ministry of Transport, Innovation and Technology), Erich Prem 
(Austria’s FIT-IT Embedded Systems Program), and Gunter Haring (University 
of Vienna). Jorgen Bang Jensen, CEO of Austria’s mobile communications pro- 
vider ONE, Florian Pollack (head of ONE Mobile Living), and Florian Stieger 
(head of ONE Smart Space) generously helped in facilitating PERVASIVE 2004 
and hosted the program committee meeting in ONE’s smart space. Finally, we 
are grateful for the cooperative interaction with the organizers of the UbiComp 
conference series and their helpful support in finding the right time slot for 
this and future PERVASIVE conferences - PERVASIVE is planned to happen 
annually in spring, UbiComp in fall. Particular thanks go to Gregory Abowd 
(Georgia Institute of Technology), Hans- Werner Gellersen (Lancaster Univer- 
sity), Albrecht Schmidt (Ludwig-Maximilians-Universitat Miinchen), Lars Erik 
Holmquist (Viktoria Institute), Tom Rodden (Nottingham University), Anind 
Dey (Intel Research Berkeley), and Joe McCarthy (Intel Research Seattle) for 
their mentoring efforts - we look forward to a lively and sisterly interaction with 
UbiComp. 

Finally, this booklet would not be in your hands without the hard work and 
selfless contributions of Rene Mayrhofer, our technical editor, and the patience 
and professional support of Alfred Hofmann and his team at Springer-Verlag. 
Last but not least we would like to express our sincere appreciation to the orga- 
nizing committee at the Institute for Pervasive Computing at the University of 
Linz, in particular Monika Scholl, Sandra Derntl, and Karin Haudum, as well as 
Rene Mayrhofer, Simon Vogl, Dominik Hochreiter, Volker Christian, Wolfgang 
Narzt, Hans-Peter Baumgartner, Clemens Holzmann, Stefan Oppl, Manfred He- 
chinger, Gunter Blaschek, and Thomas Scheidl. 

The numerous authors who submitted papers, expressing their interest in 
PERVASIVE as the outlet for their research work, deserve our deepest thanks. 
It is their work - very often conducted in selfless and expendable efforts - that 
gives PERVASIVE its special vitality. We wish to strongly encourage the authors 
not presenting this year to continue their endeavors, and the participants new 
to PERVASIVE to remain part of it by submitting next year. We all hope that 
this year’s program met with your approval, and we encourage you to actively 
contribute to (and thus steer) future PERVASIVE events. 
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Activity Recognition from User- Annotated 
Acceleration Data 



Ling Bao and Stephen S. Intille 

Massachusetts Institute of Technology 
1 Cambridge Center, 4FL 
Cambridge, MA 02142 USA 
intille@mit . edu 



Abstract. In this work, algorithms are developed and evaluated to de- 
tect physical activities from data acquired using five small biaxial ac- 
celerometers worn simultaneously on different parts of the body. Ac- 
celeration data was collected from 20 subjects without researcher su- 
pervision or observation. Subjects were asked to perform a sequence of 
everyday tasks but not told specifically where or how to do them. Mean, 
energy, frequency-domain entropy, and correlation of acceleration data 
was calculated and several classifiers using these features were tested. De- 
cision tree classifiers showed the best performance recognizing everyday 
activities with an overall accuracy rate of 84%. The results show that 
although some activities are recognized well with subject-independent 
training data, others appear to require subject-specific training data. The 
results suggest that multiple accelerometers aid in recognition because 
conjunctions in acceleration feature values can effectively discriminate 
many activities. With just two biaxial accelerometers - thigh and wrist 
- the recognition performance dropped only slightly. This is the first 
work to investigate performance of recognition algorithms with multiple, 
wire-free accelerometers on 20 activities using datasets annotated by the 
subjects themselves. 



1 Introduction 

One of the key difficulties in creating useful and robust ubiquitous, context-aware 
computer applications is developing the algorithms that can detect context from 
noisy and often ambiguous sensor data. One facet of the user’s context is his phys- 
ical activity. Although prior work discusses physical activity recognition using 
acceleration (e.g. [17,5,23]) or a fusion of acceleration and other data modalities 
(e.g. [18]), it is unclear how most prior systems will perform under real-world 
conditions. Most of these works compute recognition results with data collected 
from subjects under artificially constrained laboratory settings. Some also evalu- 
ate recognition performance on data collected in natural, out-of-lab settings but 
only use limited data sets collected from one individual (e.g. [22]). A number 
of works use naturalistic data but do not quantify recognition accuracy. Lastly, 
research using naturalistic data collected from multiple subjects has focused on 



A. Ferscha and F. Mattern (Eds.): PERVASIVE 2004, LNCS 3001, pp. 1-17, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




2 



L. Bao and S.S. Intille 



recognition of a limited subset of nine or fewer everyday activities consisting 
largely of ambulatory motions and basic postures such as sitting and stand- 
ing (e.g. [10,5]). It is uncertain how prior systems will perform in recognizing a 
variety of everyday activities for a diverse sample population under real-world 
conditions. 

In this work, the performance of activity recognition algorithms under condi- 
tions akin to those found in real-world settings is assessed. Activity recognition 
results are based on acceleration data collected from five biaxial accelerometers 
placed on 20 subjects under laboratory and semi-naturalistic conditions. Super- 
vised learning classifiers are trained on labeled data that is acquired without 
researcher supervision from subjects themselves. Algorithms trained using only 
user-labeled data might dramatically increase the amount of training data that 
can be collected and permit users to train algorithms to recognize their own 
individual behaviors. 



2 Background 

Researchers have already prototyped wearable computer systems that use ac- 
celeration, audio, video, and other sensors to recognize user activity (e.g. [7]). 
Advances in miniaturization will permit accelerometers to be embedded within 
wrist bands, bracelets, adhesive patches, and belts and to wirelessly send data to 
a mobile computing device that can use the signals to recognize user activities. 

For these applications, it is important to train and test activity recognition 
systems on data collected under naturalistic circumstances, because laboratory 
environments may artificially constrict, simplify, or influence subject activity 
patterns. For instance, laboratory acceleration data of walking displays distinct 
phases of a consistent gait cycle which can aide recognition of pace and incline 
[2]. However, acceleration data from the same subject outside of the laboratory 
may display marked fluctuation in the relation of gait phases and total gait 
length due to decreased self-awareness and fluctuations in traffic. Consequently, 
a highly accurate activity recognition algorithm trained on data where subjects 
are told exactly where or how to walk (or where the subjects are the researchers 
themselves) may rely too heavily on distinct phases and periodicity of accelerom- 
eter signals found only in the lab. The accuracy of such a system may suffer when 
tested on naturalistic data, where there is greater variation in gait pattern. 

Many past works have demonstrated 85% to 95% recognition rates for ambu- 
lation, posture, and other activities using acceleration data. Some are summa- 
rized in Figure 1 (see [3] for a summary of other work). Activity recognition has 
been performed on acceleration data collected from the hip (e.g. [17,19]) and 
from multiple locations on the body (e.g. [5,14]). Related work using activity 
counts and computer vision also supports the potential for activity recognition 
using acceleration. The energy of a subject’s acceleration can discriminate seden- 
tary activities such as sitting or sleeping from moderate intensity activities such 
as walking or typing and vigorous activities such as running [25]. Recent work 
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Ref. 


Recognition Activities 
Accuracy Recognized 


No. 

Subj. 


Data No. Sensor 
Type Sensors Placement 


[17] 


92.85% 
to 95.91% 


ambulation 


8 


L 


2 


2 thigh 


[19] 


83% 
to 90% 


ambulation, posture 


6 


L 


6 


3 left hip, 

3 right hip 


[10] 


95.8% 


ambulation, posture, 
typing, talking, bicycling 


24 


L 


4 


chest, thigh, 
wrist, forearm 


[10] 


66.7% 


ambulation, posture, 
typing, talking, bicycling 


24 


N 


4 


chest, thigh, 
wrist, forearm 


[1] 


89.30% 


ambulation, posture 


5 


L 


2 


chest, thigh 


[12] 


N/A 


walking speed, incline 


20 


L 


4 


3 lower back 
1 ankle 


[22] 


86% 
to 93% 


ambulation, posture, 
play 


1 


N 


3 


2 waist, 
1 thigh 


[14] 


«65% 
to «95% 


ambulation, typing, stairs 
shake hands, write on board 


1 


L 


up to 
36 


all major 
joints 


[6] 


96.67% 


3 Rung Fu 
arm movements 


1 


L 


2 


2 wrist 


[23] 


42% 
to 96% 


ambulation, posture, 
bicycling 


1 


L 


2 


2 lower back 


[20] 


85% 
to 90% 


ambulation, posture 


10 


L 


2 


2 knee 



Fig. 1 . Summary of a representative sample of past work on activity recognition using 
acceleration. The “No. Subj.” column specifies the number of subjects who participated 
in each study, and the “Data Type” column specifies whether data was collected under 
laboratory (L) or naturalistic (N) settings. The “No. Sensors” column specifies the 
number of uniaxial accelerometers used per subject. 



with 30 wired accelerometers spread across the body suggests that the addition 
of sensors will generally improve recognition performance [24]. 

Although the literature supports the use of acceleration for physical activ- 
ity recognition, little work has been done to validate the idea under real-world 
circumstances. Most prior work on activity recognition using acceleration relies 
on data collected in controlled laboratory settings. Typically, the researcher col- 
lected data from a very small number of subjects, and often the subjects have 
included the researchers themselves. The researchers then hand-annotated the 
collected data. Ideally, data would be collected in less controlled settings with- 
out researcher supervision. Further, to increase the volume of data collected, 
subjects would be capable of annotating their own data sets. Algorithms that 
could be trained using only user-labeled data might dramatically increase the 
amount of training data that can be collected and permit users to train algo- 
rithms to recognize their own individual behaviors. In this work we assume that 
labeled training data is required for many automatic activity recognition tasks. 
We note, however, that one recent study has shown that unsupervised learning 
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can be used to cluster accelerometer data into categories that, in some instances, 
map onto meaningful labels [15]. 

The vast majority of prior work focuses on recognizing a special subset of 
physical activities such as ambulation, with the exception of [10] which examines 
nine everyday activities. Interestingly, [10] demonstrated 95.8% recognition rates 
for data collected in the laboratory but recognition rates dropped to 66.7% 
for data collected outside the laboratory in naturalistic settings. These results 
demonstrate that the performance of algorithms tested only on laboratory data 
or data acquired from the experimenters themselves may suffer when tested on 
data collected under less-controlled (i.e. naturalistic) circumstances. 

Prior literature demonstrates that forms of locomotion such as walking, run- 
ning, and climbing stairs and postures such as sitting, standing, and lying down 
can be recognized at 83% to 95% accuracy rates using hip, thigh, and ankle 
acceleration (see Figure 1). Acceleration data of the wrist and arm are known to 
improve recognition rates of upper body activities [6,10] such as typing and mar- 
tial arts movements. All past works with multiple accelerometers have used ac- 
celerometers connected with wires, which may restrict subject movement. Based 
on these results, this work uses data collected from five wire-free biaxial ac- 
celerometers placed on each subject’s right hip, dominant wrist, non-dominant 
upper arm, dominant ankle, and non-dominant thigh to recognize ambulation, 
posture, and other everyday activities. Although each of the above five locations 
have been used for sensor placement in past work, no work addresses which 
of the accelerometer locations provide the best data for recognizing activities 
even though it has been suggested that for some activities that more sensors 
improve recognition [24] . Prior work has typically been conducted with only 1-2 
accelerometers worn at different locations on the body, with only a few using 
more than 5 (e.g. [19,14,24]). 

3 Design 

Subjects wore 5 biaxial accelerometers as they performed a variety of activities 
under two different data collection protocols. 



3.1 Accelerometers 

Subject acceleration was collected using ADXL210E accelerometers from Analog 
Devices. These two- axis accelerometers are accurate to ±10 G with tolerances 
within 2%. Accelerometers were mounted to hoarder boards [11], which sampled 
at 76.25 Hz (with minor variations based on onboard clock accuracy) and stored 
acceleration data on compact flash memory. This sampling frequency is more 
than sufficient compared to the 20 Hz frequency required to assess daily physical 
activity [4]. The hoarder board time stamped one out of every 100 acceleration 
samples, or one every 1.31 seconds. Four AAA batteries can power the hoarder 
board for roughly 24 hours. This is more than sufficient for the 90 minute data 
collection sessions used in this study. A hoarder board is shown in Figure 2a. 
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Running Brushing teeth 




(a) (b) (c) 



Fig. 2. (a) Hoarder data collection board, which stored data from a biaxial accelerom- 
eter. The biaxial accelerometers are attached to the opposite side of the board, (b) 
Hoarder boards were attached to 20 subjects on the 4 limb positions shown here (held 
on with medical gauze), plus the right hip. (c) Acceleration signals from five biaxial 
accelerometers for walking, running, and tooth brushing. 



Previous work shows promising activity recognition results from ±2 G accel- 
eration data (e.g. [9,14]) even though typical body acceleration amplitude can 
range up to 12 G [4]. However, due to limitations in availability of ±12 G ac- 
celerometers, ±10 G acceleration data was used. Moreover, although body limbs 
and extremities can exhibit a 12 G range in acceleration, points near the torso 
and hip experience a 6 G range in acceleration [4]. 

The hoarder boards were not electronically synchronized to each other and 
relied on independent quartz clocks to time stamp data. Electronic synchroniza- 
tion would have required wiring between the boards which, even when the wiring 
is carefully designed as in [14], would restrict subject movements, especially dur- 
ing whole body activities such as bicycling or running. Further, we have found 
subjects wearing wiring feel self-conscious when outside of the laboratory and 
therefore restrict their behavior. 

To achieve synchronization without wires, hoarder board clocks were syn- 
chronized with subjects’ watch times at the beginning of each data collection 
session. Due to clock skew, hoarder clocks and the watch clock drifted between 
1 and 3 seconds every 24 hours. To minimize the effects of clock skew, hoarder 
boards were shaken together in a fixed sinusoidal pattern in two axes of accel- 
eration at the beginning and end of each data collection session. Watch times 
were manually recorded for the periods of shaking. The peaks of the distinct 
sinusoidal patterns at the beginning and end of each acceleration signal were 
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visually aligned between the hoarder boards. Time stamps during the shaking 
period were also shifted to be consistent with the recorded watch times for 
shaking. Acceleration time stamps were linearly scaled between these manually 
aligned start and end points. 

To characterize the accuracy of the synchronization process, three hoarder 
boards were synchronized with each other and a digital watch using the above 
protocol. The boards were then shaken together several times during a full day 
to produce matching sinusoidal patterns on all boards. Visually comparing the 
peaks of these matching sinusoids across the three boards showed mean skew 
of 4.3 samples with a standard deviation of 1.8 samples between the boards. 
At a sampling frequency of 76.25 Hz, the skew between boards is equivalent to 
.0564 ± .0236 s. 

A T-Mobile Sidekick phone pouch was used as a carrying case for each 
hoarder board. The carrying case was light, durable, and provided protection 
for the electronics. A carrying case was secured to the subject’s belt on the right 
hip. All subjects were asked to wear clothing with a belt. Elastic medical ban- 
dages were used to wrap and secure carrying cases at sites other than the hip. 
Typical placement of hoarder boards is shown in Figure 2b. Figure 2c shows 
acceleration data collected for walking, running, and tooth brushing from the 
five accelerometers. 

No wires were used to connect the hoarder boards to each other or any other 
devices. Each hoarder in its carrying case weighed less than 120 g. Subjects 
could engage in vigorous, complex activity without any restriction on movement 
or fear of damaging the electronics. The sensors were still visually noticeable. 
Subjects who could not wear the devices under bulky clothing did report feeling 
self conscious in public spaces. 



3.2 Activity Labels 

Twenty activities were studied. These activities are listed in Figure 5. The 20 
activities were selected to include a range of common everyday household ac- 
tivities that involve different parts of the body and range in level of intensity. 
Whole body activities such as walking, predominantly arm-based activities such 
as brushing of teeth, and predominantly leg-based activities such as bicycling 
were included as were sedentary activities such as sitting, light intensity activi- 
ties such as eating, moderate intensity activities such as window scrubbing, and 
vigorous activities such as running. Activity labels were chosen to reflect the 
content of the actions but do not specify the style. For instance, “walking” could 
be parameterized by walking speed and quantized into slow and brisk or other 
categories. 



3.3 Semi-naturalistic, User-Driven Data Collection 

The most realistic training and test data would be naturalistic data acquired 
from subjects as they go about their normal, everyday activities. Unfortunately, 




Activity Recognition from User- Annotated Acceleration Data 



7 



obtaining such data requires direct observation of subjects by researchers, sub- 
ject self-report of activities, or use of the experience sampling method [8] to 
label subject activities for algorithm training and testing. Direct observation 
can be costly and scales poorly for the study of large subject populations. Sub- 
ject self-report recall surveys are prone to recall errors [8] and lack the temporal 
precision required for training activity recognition algorithms. Finally, the expe- 
rience sampling method requires frequent interruption of subject activity, which 
agitates subjects over an extended period of time. Some activities we would like 
to develop recognition algorithms for, such as folding laundry, riding escalators, 
and scrubbing windows, may not occur on a daily basis. A purely naturalis- 
tic protocol would not capture sufficient samples of these activities for thorough 
testing of recognition systems without prohibitively long data collection periods. 

In this work we compromise and use a semi-naturalistic collection protocol 
that should permit greater subject variability in behavior than laboratory data. 
Further, we show how training sets can be acquired from subjects themselves 
without the direct supervision of a researcher, which may prove important if 
training data must be collected by end users to improve recognition performance. 

For semi-naturalistic data collection, subjects ran an obstacle course consist- 
ing of a series of activities listed on a worksheet. These activities were disguised 
as goals in an obstacle course to minimize subject awareness of data collection. 
For instance, subjects were asked to “use the web to find out what the world’s 
largest city in terms of population is” instead of being asked to “work on a com- 
puter.” Subjects recorded the time they began each obstacle and the time they 
completed each obstacle. Subjects completed each obstacle on the course ensur- 
ing capture of all 20 activities being studied. There was no researcher supervision 
of subjects while they collected data under the semi-naturalistic collection pro- 
tocol. As subjects performed each of these obstacles in the order given on their 
worksheet, they labeled the start and stop times for that activity and made any 
relevant notes about that activity. Acceleration data collected between the start 
and stop times were labeled with the name of that activity. Subjects were free to 
rest between obstacles and proceed through the worksheet at their own pace as 
long as they performed obstacles in the order given. Furthermore, subjects had 
freedom in how they performed each obstacle. For example, one obstacle was 
to “read the newspaper in the common room. Read the entirety of at least one 
non-frontpage article.” The subject could choose which and exactly how many 
articles to read. Many activities were performed outside of the lab. Subjects were 
not told where or how to perform activities and could do so in a common room 
within the lab equipped with a television, vacuum, sofa, and reading materials 
or anywhere they preferred. No researchers or cameras monitored the subjects. 



3.4 Specific Activity Data Collection 

After completing the semi-naturalistic obstacle course, subjects underwent an- 
other data collection session to collect data under somewhat more controlled 
conditions. Linguistic definitions of activity are often ambiguous. The activity 
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Fig. 3. (a) Five minutes of 2-axis acceleration data annotated with subject self-report 
activity labels. Data within 10s of self-report labels is discarded as indicated by mask- 
ing. (b) Differences in feature values computed from FFTs are used to discriminate 
between different activities. 



“scrubbing,” for example, can be interpreted as window scrubbing, dish scrub- 
bing, or car scrubbing. For this data collection session, subjects were therefore 
given short definitions of the 20 activity labels that resolved major ambiguities 
in the activity labels while leaving room for interpretation so that subjects could 
show natural, individual variations in how they performed activities. For exam- 
ple, walking was described as “walking without carrying any items in you hand 
or on your back heavier than a pound” and scrubbing was described as “using 
a sponge, towel, or paper towel to wipe a window.” See [3] for descriptions for 
all 20 activities. 

Subjects were requested to perform random sequences of the 20 activities 
defined on a worksheet during laboratory data collection. Subjects performed 
the sequence of activities given at their own pace and labeled the start and end 
times of each activity. For example, the first 3 activities listed on the worksheet 
might be “bicycling,” “riding elevator,” and “standing still.” The researcher’s 
definition of each of these activities was provided. As subjects performed each of 
these activities in the order given on their worksheet, they labeled the start and 
stop times for that activity and made any relevant notes about that activity such 
as “I climbed the stairs instead of using the elevator since the elevator was out 
of service.” Acceleration data collected between the start and stop times were 
labeled with the name of that activity. To minimize mislabeling, data within 10 s 
of the start and stop times was discarded. Since the subject is probably standing 
still or sitting while he records the start and stop times, the data immediately 
around these times may not correspond to the activity label. Figure 3a shows 
acceleration data annotated with subject self-report labels. 

Although data collected under this second protocol is more structured than 
the first, it was still acquired under less controlled conditions than in most prior 
work. Subjects, who were not the researchers, could perform their activities 
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anywhere including outside of the laboratory. Also, there was no researcher su- 
pervision during the data collection session. 

3.5 Feature Computation 

Features were computed on 512 sample windows of acceleration data with 256 
samples overlapping between consecutive windows. At a sampling frequency of 
76.25 Hz, each window represents 6.7 seconds. Mean, energy, frequency-domain 
entropy, and correlation features were extracted from the sliding windows signals 
for activity recognition. Feature extraction on sliding windows with 50% overlap 
has demonstrated success in past works [9,23]. A window of several seconds 
was used to sufficiently capture cycles in activities such as walking, window 
scrubbing, or vacuuming. The 512 sample window size enabled fast computation 
of FFTs used for some of the features. 

The DC feature is the mean acceleration value of the signal over the window. 
The energy feature was calculated as the sum of the squared discrete FFT com- 
ponent magnitudes of the signal. The sum was divided by the window length 
for normalization. Additionally, the DC component of the FFT was excluded 
in this sum since the DC characteristic of the signal is already measured by 
another feature. Note that the FFT algorithm used produced 512 components 
for each 512 sample window. Use of mean [10,1] and energy [21] of acceleration 
features has been shown to result in accurate recognition of certain postures and 
activities (see Figure 1). 

Frequency-domain entropy is calculated as the normalized information en- 
tropy of the discrete FFT component magnitudes of the signal. Again, the DC 
component of the FFT was excluded in this calculation. This feature may sup- 
port discrimination of activities with similar energy values. For instance, biking 
and running may result in roughly the same amounts of energy in the hip acceler- 
ation data. However, because biking involves a nearly uniform circular movement 
of the legs, a discrete FFT of hip acceleration in the vertical direction may show 
a single dominant frequency component at 1 Hz and very low magnitude for all 
other frequencies. This would result in a low frequency-domain entropy. Running 
on the other hand may result in complex hip acceleration and many major FFT 
frequency components between 0.5 Hz and 2 Hz. This would result in a higher 
frequency- domain entropy. 

Features that measure correlation or acceleration between axes can improve 
recognition of activities involving movements of multiple body parts [12,2]. Cor- 
relation is calculated between the two axes of each accelerometer hoarder board 
and between all pairwise combinations of axes on different hoarder boards. 

Figure 3b shows some of these features for two activities. It was anticipated 
that certain activities would be difficult to discriminate using these features. 
For example, “watching TV” and “sitting” should exhibit very similar if not 
identical body acceleration. Additionally, activities such as “stretching” could 
show marked variation from person to person and for the same person at different 
times. Stretching could involve light or moderate energy acceleration in the upper 
body, torso, or lower body. 
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As discussed in the the next section, several classifiers were tested for activity 
recognition using the feature vector. 

4 Evaluation 

Subjects were recruited using posters seeking research study participants for 
compensation. Posters were distributed around an academic campus and were 
also emailed to the student population. Twenty subjects from the academic com- 
munity volunteered. Data was collected from 13 males and 7 females. Subjects 
ranged in age from 17 to 48 (mean 21.8, sd 6.59). 

Each subject participated in two sessions of study. In the first session, sub- 
jects wore five accelerometers and a digital watch. Subjects collected the semi- 
naturalistic data by completing an obstacle course worksheet, noting the start 
and end times of each obstacle on the worksheet. Each subject collected between 
82 and 160 minutes of data (mean 104, sd 13.4). Six subjects skipped between 
one to two obstacles due to factors such as inclement weather, time constraints, 
or problems with equipment in the common room (e.g. the television, vacuum, 
computer, and bicycle). Subjects performed each activity on their obstacle course 
for an average of 156 seconds (sd 50). 

In the second session, often performed on a different day, the same subjects 
wore the same set of sensors. Subjects performed the sequence of activities listed 
on an activity worksheet, noting the start and end times of these activities. Each 
subject collected between 54 and 131 minutes of data (mean 96, sd 16.7). Eight 
subjects skipped between one to four activities due to factors listed earlier. 



4.1 Results 

Mean, energy, entropy, and correlation features were extracted from acceleration 
data. Activity recognition on these features was performed using decision table, 
instance-based learning (IBL or nearest neighbor), C4.5 decision tree, and naive 
Bayes classifiers found in the Weka Machine Learning Algorithms Toolkit [26]. 

Classifiers were trained and tested using two protocols. Under the first proto- 
col, classifiers were trained on each subject’s activity sequence data and tested on 
that subject’s obstacle course data. This user-specific training protocol was re- 
peated for all twenty subjects. Under the second protocol, classifiers were trained 
on activity sequence and obstacle course data for all subjects except one. The 
classifiers were then tested on obstacle course data for the only subject left out of 
the training data set. This leave-one-subject-out validation process was repeated 
for all twenty subjects. Mean and standard deviation for classification accuracy 
under both protocols is summarized in Figure 4. 

Overall, recognition accuracy is highest for decision tree classifiers, which 
is consistent with past work where decision based algorithms recognized lying, 
sitting, standing and locomotion with 89.30% accuracy [1]. Nearest neighbor 
is the second most accurate algorithm and its strong relative performance is 
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Classifier 


User- specific 
Training 


Leave- one- sub j ect-out 
Training 


Decision Table 


36.32 ±14.501 


46.75 ± 9.296 


IBL 


69.21 ±6.822 


82.70 ±6.416 


C4.5 


71.58 ±7.438 


84.26 ±5.178 


Naive Bayes 


34.94 ± 5.818 


52.35 ± 1.690 



Fig. 4. Summary of classifier results (mean ± standard deviation) using user-specific 
training and leave-one-subject-out training. Classifiers were trained on laboratory data 
and tested on obstacle course data. 



Activity 


Accuracy | Activity 


Accuracy 


Walking 


89.71 


Walking carrying items 


82.10 


Sitting V relaxing 


94.78 


Working on computer 


97.49 


Standing still 


95.67 


Eating or drinking 


88.67 


Watching TV 


77.29 


Reading 


91.79 


Running 


87.68 


Bicycling 


96.29 


Stretching 


41.42 


Strength-training 


82.51 


Scrubbing 


81.09 


Vacuuming 


96.41 


Folding laundry 


95.14 


Lying down V relaxing 


94.96 


Brushing teeth 


85.27 


Climbing stairs 


85.61 


Riding elevator 


43.58 


Riding escalator 


70.56 



Fig. 5. Aggregate recognition rates (%) for activities studied using leave-one-subject- 
out validation over 20 subjects. 



also supported by past prior work where nearest neighbor algorithms recognized 
ambulation and postures with over 90% accuracy [16,10]. 

Figure 5 shows the recognition results for the C4.5 classifier. Rule-based 
activity recognition appears to capture conjunctions in feature values that may 
lead to good recognition accuracy. For instance, the C4.5 decision tree classified 
sitting as an activity having nearly 1 G downward acceleration and low energy at 
both hip and arm. The tree classified bicycling as an activity involving moderate 
energy levels and low frequency- domain entropy at the hip and low energy levels 
at the arm. The tree distinguishes “window scrubbing” from “brushing teeth” 
because the first activity involves more energy in hip acceleration even though 
both activities show high energy in arm acceleration. The fitting of probability 
distributions to acceleration features under a Naive Bayesian approach may be 
unable to adequately model such rules due to the assumptions of conditional 
independence between features and normal distribution of feature values, which 
may account for the weaker performance. Furthermore, Bayesian algorithms may 
require more data to accurately model feature value distributions. 

Figure 6 shows an aggregate confusion matrix for the C4.5 classifier based 
on all 20 trials of leave-one-subject-out validation. Recognition accuracies for 
stretching and riding an elevator were below 50%. Recognition accuracies for 
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Fig. 6. Aggregate confusion matrix for C4.5 classifier based on leave-one-subject-out 
validation for 20 subjects, tested on semi-naturalistic data. 



“watching TV” and “riding escalator” were 77.29% and 70.56%, respectively. 
These activities do not have simple characteristics and are easily confused with 
other activities. For instance, “stretching” is often misclassified as “folding laun- 
dry” because both may involve the subject moving the arms at a moderate rate. 
Similarly, “riding elevator” is misclassified as “riding escalator” since both in- 
volve the subject standing still. “Watching TV” is confused with “sitting and 
relaxing” and “reading” because all the activities involve sitting. “Riding es- 
calator” is confused with “riding elevator” since the subject may experience 
similar vertical acceleration in both cases. “Riding escalator” is also confused 
with “climbing stairs” since the subject sometimes climbs the escalator stairs. 

Recognition accuracy was significantly higher for all algorithms under the 
leave-one- subject-out validation process. This indicates that the effects of indi- 
vidual variation in body acceleration may be dominated by strong commonalities 
between people in activity pattern. Additionally, because leave-one-subject-out 
validation resulted in larger training sets consisting of data from 19 subjects, this 
protocol may have resulted in more generalized and robust activity classifiers. 
The markedly smaller training sets used for the user-specific training protocol 
may have limited the accuracy of classifiers. 

To control for the effects of sample size in comparing leave-one-subject-out 
and user-specific training, preliminary results were gathered using a larger train- 
ing data set collected for three subjects. These subjects were affiliates of the 
researchers (unlike the 20 primary subjects). Each of these subjects participated 
in one semi- naturalistic and five laboratory data collection sessions. The C4.5 
decision tree algorithm was trained for each individual using data collected from 
all five of his laboratory sessions and tested on the semi-naturalistic data. The 
algorithm was also trained on five laboratory data sets from five random sub- 
jects other than the individual and tested on the individual’s semi-naturalistic 
data. The results are compared in Figure 7. In this case, user-specific training 
resulted in an increase in recognition accuracy of 4.32% over recognition rates for 
leave-one-subject-out-training. This difference shows that given equal amounts 
of training data, training on user-specific training data can result in classifiers 
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User- specific 


Leave-one-sub j ect-out 


Classifier 


Training 


Training 


C4.5 


77.31 ±4.328 


72.99 ± 8.482 



Fig. 7 . Summary of classifier results (mean ± standard deviation) using user-specific 
training and leave-one-subject-out training where both training data sets are equivalent 
to five laboratory data sessions. 



that recognize activities more accurately than classifiers trained on example data 
from many people. However, the certainty of these conclusions is limited by the 
low number of subjects used for this comparison and the fact that the three 
individuals studied were affiliates of the researchers. Nonetheless, these initial 
results support the need for further study of the power of user-specific versus 
generalized training sets. 

The above results suggest that real-world activity recognition systems can 
rely on classifiers that are pre-trained on large activity data sets to recognize 
some activities. Although preliminary results show that user-specific training 
can lead to more accurate activity recognition given large training sets, pre- 
trained systems offer greater convenience. Pre-trained systems could recognize 
many activities accurately without requiring training on data from their user, 
simplifying the deployment of these systems. Furthermore, since the activity 
recognition system needs to be trained only once before deployment, the slow 
running time for decision tree training is not an obstacle. Nonetheless, there may 
be limitations to a pre-trained algorithm. Although activities such as “running” 
or “walking” may be accurately recognized, activities that are more dependent 
upon individual variation and the environment (e.g. “stretching”) may require 
person-specific training [13]). 

To evaluate the discriminatory power of each accelerometer location, recog- 
nition accuracy using the decision tree classifier (the best performing algo- 
rithm) was also computed using a leave-one- accelerometer- in protocol. Specifi- 
cally, recognition results were computed five times, each time using data from 
only one of the five accelerometers for the training and testing of the algorithm. 
The differences in recognition accuracy rates using this protocol from accuracy 
rates obtained from all five accelerometers are summarized in Figure 8. These 
results show that the accelerometer placed on the subject’s thigh is the most 
powerful for recognizing this set of 20 activities. Acceleration of the dominant 
wrist is more useful in discriminating these activities than acceleration of the 
non-dominant arm. Acceleration of the hip is the second best location for activ- 
ity discrimination. This suggests that an accelerometer attached to a subject’s 
cell phone, which is often placed at a fixed location such as on a belt clip, may 
enable recognition of certain activities. 

Confusion matrices resulting from leave-one-accelerometer-in testing [3] show 
that data collected from lower body accelerometers placed on the thigh, hip, 
and ankle is generally best at recognizing forms of ambulation and posture. Ac- 
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Accelerometer (s) Left tn 


Difference in Recognition Accuracy 


Hip 


-34.12 ±7.115 


Wrist 


-51.99 ±12.194 


Arm 


-63.65 ± 13.143 


Ankle 


-37.08 ± 7.601 


Thigh 


-29.47 ±4.855 


Thigh and Wrist 


-3.27 ±1.062 


Hip and Wrist 


-4.78 ± 1.331 



Fig. 8. Difference in overall recognition accuracy (mean ± standard deviation) due to 
leaving only one or two accelerometers in. Accuracy rates are aggregated for 20 subjects 
using leave- one- subject- out validation. 



celerometer data collected from the wrist and arm is better at discriminating 
activities involving characteristic upper body movements such as reading from 
watching TV or sitting and strength-training (push ups) from stretching. To 
explore the power of combining upper and lower body accelerometer data, data 
from thigh and wrist accelerometers and hip and wrist accelerometers were also 
used and results are shown in Figure 8. Note that recognition rates improved 
over 25% for the leave-two- accelerometers- in results as compared to the best 
leave-one-accelerometer-in results. Of the two pairs tested, thigh and wrist ac- 
celeration data resulted in the highest recognition accuracy. However, both thigh 
and wrist and hip and wrist pairs showed less than a 5% decrease in recognition 
rate from results using all five accelerometer signals. This suggests that effective 
recognition of certain everyday activities can be achieved using two accelerom- 
eters placed on the wrist and thigh or wrist and hip. Others have also found 
that for complex activities at least one sensor on the lower and upper body is 
desirable [14]. 1 



4.2 Analysis 

This work shows that user-specific training is not necessary to achieve recogni- 
tion rates for some activities of over 80% for 20 everyday activities. Classifica- 
tion accuracy rates of between 80% to 95% for walking, running, climbing stairs, 
standing still, sitting, lying down, working on a computer, bicycling, and vacuum- 
ing are comparable with recognition results using laboratory data from previous 
works. However, most prior has used data collected under controlled laboratory 
conditions to achieve their recognition accuracy rates, typically where data is 
hand annotated by a researcher. The 84.26% overall recognition rate achieved in 
this work is significant because study subjects could move about freely outside 
the lab without researcher supervision while collecting and annotating their own 

1 Only the decision tree algorithm was used to evaluate the information content of 
specific sensors, leaving open the possibility that other algorithms may perform 
better with different sensor placements. 
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semi-naturalistic data. This is a step towards creating mobile computing systems 
that work outside of the laboratory setting. 

The C4.5 classifier used mean acceleration to recognize postures such as 
sitting, standing still, and lying down. Ambulatory activities and bicycling were 
recognized by the level of hip acceleration energy. Frequency-domain entropy and 
correlation between arm and hip acceleration strongly distinguished bicycling, 
which showed low entropy hip acceleration and low arm-hip correlation, from 
running, which displayed higher entropy in hip acceleration and higher arm-hip 
movement correlation. Both activities showed similar levels of hip acceleration 
mean and energy. Working on a computer, eating or drinking, reading, strength- 
training as defined by a combination of sit ups and push-ups, window scrubbing, 
vacuuming, and brushing teeth were recognized by arm posture and movement 
as measured by mean acceleration and energy. 

Lower recognition accuracies for activities such as stretching, scrubbing, rid- 
ing an elevator, and riding an escalator suggest that higher level analysis is re- 
quired to improve classification of these activities. Temporal information in the 
form of duration and time and day of activities could be used to detect activities. 
For instance, standing still and riding an elevator are similar in terms of body 
posture. However, riding an elevator usually lasts for a minute or less whereas 
standing still can last for a much longer duration. By considering the duration of 
a particular posture or type of body acceleration, these activities could be dis- 
tinguished from each other with greater accuracy. Similarly, adults may be more 
likely to watch TV at night than at other times on a weekday. Thus, date and 
time or other multi-modal sensing could be used to improve discrimination of 
watching TV from simply sitting and relaxing. However, because daily activity 
patterns may vary dramatically across individuals, user-specific training may be 
required to effectively use date and time information for activity recognition. 

The decision tree algorithm used in this work can recognize the content of 
activities, but may not readily recognize activity style. Although a decision tree 
algorithm could potentially recognize activity style using a greater number of 
labels such as “walking slowly,” “walking briskly,” “scrubbing softly,” or “scrub- 
bing vigorously,” the extensibility of this technique is limited. For example, the 
exact pace of walking cannot be recognized using any number of labels. Other 
techniques may be required to recognize parameterized activity style. 

Use of other sensor data modalities may further improve activity recognition. 
Heart rate data could be used to augment acceleration data to detect intensity 
of physical activities. GPS location data could be used to infer whether an in- 
dividual is at home or at work and affect the probability of activities such as 
working on the computer or lying down and relaxing. Use of such person- specific 
sensors such as GPS, however, is more likely to require that training data be ac- 
quired directly from the individual rather than from a laboratory setting because 
individuals can work, reside, and shop in totally different locations. 
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5 Conclusion 

Using decision tree classifiers, recognition accuracy of over 80% on a variety of 20 
everyday activities was achieved using leave-one-subject-out-validation on data 
acquired without researcher supervision from 20 subjects. These results are com- 
petitive with prior activity recognition results that only used laboratory data. 
Furthermore, this work shows acceleration can be used to recognize a variety of 
household activities for context-aware computing. This extends previous work 
on recognizing ambulation and posture using acceleration (see Figure 1). 

This work further suggests that a mobile computer and small wireless ac- 
celerometers placed on an individual’s thigh and dominant wrist may be able 
to detect some common everyday activities in naturalistic settings using fast 
FFT-based feature computation and a decision tree classifier algorithm. Deci- 
sion trees are slow to train but quick to run. Therefore, a pre-trained decision 
tree should be able to classify user activities in real-time on emerging mobile 
computing devices with fast processors and wireless accelerometers. 
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Abstract. The paper presents a technique to automatically track the 
progress of maintenance or assembly tasks using body worn sensors. The 
technique is based on a novel way of combining data from accelerometers 
with simple frequency matching sound classification. This includes the 
intensity analysis of signals from microphones at different body locations 
to correlate environmental sounds with user activity. 

To evaluate our method we apply it to activities in a wood shop. On a 
simulated assembly task our system can successfully segment and identify 
most shop activities in a continuous data stream with zero false positives 
and 84.4% accuracy. 



1 Introduction 

Maintenance and assembly are among the most important applications of wear- 
able computing to date; the use of such technology in tasks such as aircraft 
assembly [17], vehicle maintenance [4] and other on-site tasks [2,7] demonstrates 
a genuine utility of wearable systems. 

The key characteristic of such applications is the need for the user to physi- 
cally and perceptually focus on a complex real world task. Thus in general the 
user cannot devote much attention to interaction with the system. Further the 
use of the system should not restrict the operators physical freedom of action. 
As a consequence most conventional mobile computing paradigms are unsuit- 
able for this application field. Instead wearable systems emphasizing physically 
unobtrusive form factor, hands free input, head mounted display output and low 
cognitive load interaction need to be used. 

Our work aims to further reduce the cognitive load on the user while at 
the same time extending the range of services provided by the system. To this 
end we show how wearable systems can automatically follow the progress of a 
given maintenance or assembly task using a set of simple body worn sensors. 
With such context knowledge the wearable could pro-actively provide assistance 
without the need for any explicit action by the user. For example, a maintenance 
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support system could recognize which particular subtask is being performed 
and automatically display the relevant manual pages on the system’s head-up 
display. The wearable could also record the sequence of operations that are being 
performed for later analysis, or could be used to warn the user if an important 
step has been missed. 



1.1 Related Work 

Many wearable systems explore context and proactiveness (e.g [1]) as means of 
reducing the cognitive load on the user. Much work has also been devoted to 
recognition methods, in particular the use of computer vision [20,24,25,16,15]. 

The application of proactive systems for assisting basic assembly tasks has 
been explored in [22], however this is built on the assumption of sensors inte- 
grated into the objects being assembled, not on the user doing the assembly. 

Activity recognition based on body worn sensors, in particular acceleration 
sensors, has been studied by different research groups [11,14,23]. However all of 
the above work focused on recognizing comparatively simple activities (walk- 
ing, running, and sitting). Sound based situation analysis has been investigated 
by Pelton et al. and in the wearables domain by Clarkson and Pentland [12, 
5]. Intelligent hearing aids have also exploited sound analysis to improve their 
performance [3]. 



1.2 Paper Aims and Contributions 

This paper is part of our work aiming to develop a reliable context recognition 
methodology based on simple sensors integrated in the user’s outfit and in the 
user’s artifacts (e.g. tools, appliances, or parts of the machinery) [10]). It presents 
a novel way of combining motion (acceleration) sensor based gesture recognition 
[8] with sound data from distributed microphones [18]. In particular we exploit 
intensity differences between a microphone on the wrist of the dominant hand 
and on the chest to identify relevant actions performed by the user’s hand. 

In the paper we focus on using the above method to track the progress of 
an assembly task. As described above such tasks can significantly benefit from 
activity recognition. At the same time they tend to be well structured and limited 
to a reasonable number of often repetitive actions. In addition, machines and 
tools typical to a workshop environment generate distinct sounds. Therefore 
these activities are well suited for a combination of gesture and sound-based 
recognition. 

This paper describes our approach and the results produced in an experiment 
performed on an assembly task in a wood workshop. We demonstrate that simple 
sensors placed on the user’s body can reliably select and recognize user actions 
during a workshop procedure. 
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Fig. 1 . The wood workshop (left) with (1) grinder, (2) drill, (3) file and saw, (4) vise, 
and (5) cabinet with drawers. The sensor type and placement (right): (1,4) microphone, 
(2,3,5) 3-axis acceleration sensors and (6) computer 



2 Experimental Setup 

Performing initial experiments on live assembly or maintenance tasks is inad- 
visable due to the cost and safety concerns and the ability to obtain repeatable 
measurements under experimental conditions. As a consequence we have decided 
to focus on an “artificial” task performed at the workbench of wood workshop 
of our lab (see Figure 1). The task consisted of assembling a simple object made 
of two pieces of wood and a piece of metal. The task required 8 processing steps 
using different tools; these were intermingled with actions typically exhibited 
in any real world assembly task, such as walking from one place to another or 
retrieving an item from a drawer. 



2.1 Procedure 

The assembly sequence consists of sawing a piece of wood, drilling a hole in 
it, grinding a piece of metal, attaching it to the piece of wood with a screw, 
hammering in a nail to connect the two pieces of wood, and then finishing the 
product by smoothing away rough edges with a file and a piece of sandpaper. 
The wood was fixed in the vise for sawing, filing, and smoothing (and removed 
whenever necessary). The test subject moved between areas in the workshop be- 
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Table 1 . Steps of workshop assembly task 



No 


action 


1 


take the wood out of the drawer 


2 


put the wood into the vise 


3 


take out the saw 


4 


saw 


5 


put the saw into the drawer 


6 


take the wood out of the vise 


7 


drill 


8 


get the nail and the hammer 


9 


hammer 


10 


put away the hammer, get the driver and the screw 


11 


drive the screw in 


12 


put away the driver 


13 


pick up the metal 


14 


grind 


15 


put away the metal, pick up the wood 


16 


put the wood into the vise 


17 


take the file out of the drawer 


18 


file 


19 


put away the file, take the sandpaper 


20 


sand 


21 


take the wood out of the vise 



tween steps. Also, whenever a tool or an object (nail screw, wood) was required, 
it was retrieved from its drawer in the cabinet and returned after use. 

The exact sequence of actions is listed in Table 1. The task was to recognize 
all tool-based activities. Tool-based activities exclude drawer manipulation, user 
locomotion, and clapping (a calibration gesture). The experiment was repeated 
10 times in the same sequence to collect data for training and testing. For prac- 
tical reasons, the individual processing steps were only executed long enough 
to obtain an adequate sample of the activity. This policy did not require the 
complete execution of any one task (e.g. the wood was not completely sawn), 
allowing us to complete the experiment in a reasonable amount of time. However 
this protocol influenced only the duration of each activity and not the manner 
in which it was performed. 



2.2 Data Collection System 

The data was collected using the ETH PadNET sensor network [8] equipped 
with 3 axis accelerometer nodes and two Sony mono microphones connected to 
a body worn computer. The position of the sensors on the body is shown in 
Figure 1: an accelerometer node on both wrist and on the upper arm of the right 
hand, and a microphone on the chest and on the right wrist (the test subject 
was right handed). 
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Sanding 




time [s] 




Fig. 2. Example accelerometer data from sawing and drilling ( left ); audio profile of 
sanding from wrist and chest microphones ( top right ); and clustering of activities in 
LDA space ( bottom right) 

As can be seen in Figure 1 each PadNET sensor node consist of two modules. 
The main module incorporates a MSP430149 low power 16-Bit mixed signal mi- 
croprocessor (MPU) from Texas Instruments running at 6 MHz maximum clock 
speed. The current module version reads out up to three analog sensor signals 
including amplification and filtering and handles the communication between 
modules through dedicated I/O pins. The sensors themselves are hosted on an 
even smaller ’sensor-module’ that can be either placed directly on the main 
module or connected through wires. In the experiment described in this paper 
sensor modules were based on a 3-axis accelerometer package consisting of two 
ADXL202E devices from Analog Devices. The analog signals from the sensor 
were lowpass filtered (/ cutoff = 50 Hz) and digitized with 12Bit resolution using 
a sampling rate of 100Hz. 

3 Recognition 

3.1 Acceleration Data Analysis 

Figure 2 (left) shows a segment of the acceleration data collected during the 
experiment. The segment includes sawing, removing the wood from the vise, and 
drilling. The user accesses the drawer two times and walks between the vise and 
the drill. Clear differences can be seen in the acceleration signals. For example, 
sawing clearly reflects a periodic motion. By contrast, the drawer access (marked 
as la and lb in the figure) shows a low frequency “bump” in acceleration. This 
bump corresponds to the 90 degree turns of the wrist as the user releases the 
drawer handle, retrieves the object, and grasps the handle again to close the 
drawer. 

Given the data, time series recognition techniques such as hidden Markov 
models (HMMs) [13] should allow the recognition of the relevant gestures. How- 
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ever, a closer analysis reveals two potential problems. First, not all relevant 
activities are strictly constrained to a particular sequence of motions. While the 
characteristic motions associated with sawing or hammering are distinct, there 
is high variation in drawer manipulation and grinding. Secondly, the activities 
are separated by sequences of user motions unrelated to the task (e.g the user 
scratching his head). Such motions may be confused with the relevant activities. 
We define a “noise” class to handle these unrelated gestures. 



3.2 Sound Data Analysis 

Considering that most gestures relevant for the assembly /maintenance scenario 
are associated with distinct sounds, sound analysis should help to address the 
problems described above. We distinguish between three different types of sound: 

1. Sounds made by a hand-tool: - Such sounds are directly correlated with 
user hand motion. Examples are sawing, hammering, filing, and sanding. 
These actions are generally repetitive, quasi-st at ionary sounds (i.e. relatively 
constant over time - such that each time slice on a sample would produce 
an identical spectrum over a reasonable length of time). In addition these 
sounds are much louder than the background noise (dominant) and are likely 
to be much louder at the microphone on the user’s hand than on his chest. 
For example, the intensity curve for sanding (see Figure 2 top right) reflects 
the periodic sanding motion with the minima corresponding to the changes 
in direction and the maxima coinciding with the maximum sanding speed in 
the middle of the motion. Since the user’s hand is directly on the source of 
the sound the intensity difference is large. For other activities it is smaller, 
however in most cases still detectable. 

2. Semi- autonomous sounds: These sounds are initiated by user’s hand, possibly 
(but not necessarily) remaining close to the source for most of the sound 
duration. This class includes sound produced by a machine, such as the 
drill or grinder. Although ideal quasi- stationary sounds, sounds in this class 
may not necessarily be dominant and tend to have a less distinct intensity 
difference between the hand and the chest (for example, when a user moves 
their hand away from the machine during operation). 

3. Autonomous sounds: These are sounds generated by activities not driven by 
the user’s hands (e.g loud background noises or the user speaking). 

Obviously the vast majority of relevant actions in assembly and maintenance 
are associated with handtool sounds and semi-autonomous sounds. In principle, 
these sounds should be easy to identify using intensity differences between the 
wrist and the chest microphone. In addition, if extracted appropriately, these 
sounds may be treated as quasi-stationary and can be reliably classified using 
simple spectrum pattern matching techniques. 

The main problem with this approach is that many irrelevant actions are 
also likely to fall within the definition of hand-tool and semi-autonomous sound. 
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Such actions include scratching or putting down an object. Thus, like accelera- 
tion analysis, sound-based classification also has problem distinguishing relevant 
from irrelevant actions and will produce a number of false positives. 



3.3 Recognition Methodology 

Neither acceleration nor sound provide enough information for perfect extraction 
and classification of all relevant activities; however, we hypothesize that their 
sources of error are likely to be statistically distinct. Thus, we develop a technique 
based on the fusion of both methods. Our procedure consists of three steps: 

1. Extraction of the relevant data segments using the intensity difference be- 
tween the wrist and the chest microphone. We expect that this technique 
will segment the data stream into individual actions 

2. Independent classification of the actions based on sound or acceleration. 
This step will yield imperfect recognition results by both the sound and 
acceleration subsystems. 

3. Removal of false positives. While the sound and acceleration subsystems are 
each imperfect, when their classifications of a segment agree, the result may 
be more reliable (if the sources of error are statistically distinct). 

4 Isolated Activity Recognition 

As an initial experiment, we segment the activities in the data files by hand 
and test the accuracy of the sound and acceleration methods separately. For this 
experiment, the non-tool gestures, drawer and clapping, are treated as noise and 
as such are not considered here. 



4.1 Accelerometer— Based Activity Recognition 

Hidden Markov models (HMMs) are probabilistic models used to represent non- 
deterministic processes in partially observable domains and are defined over a 
set of states, transitions, and observations. Details of HMMs and the respective 
algorithms are beyond the scope of this paper but may be found in Rabiner’s 
tutorial on the subject [13]. 

Hidden Markov models have been shown to be robust for representation 
and recognition of speech [9], handwriting [19], and gestures [21]. HMMs are 
capable of modeling important properties of gestures such as time variance (the 
same gesture can be repeated at varying speeds) and repetition (a gesture which 
contains a motion which can be repeated any number of times). They also handle 
noise due to sensors and imperfect training data by providing a probabilistic 
framework. 

For gesture recognition, a model is trained for each of the gestures to be recog- 
nized. In our experiment, the set of gestures includes saw, drill, screw, hammer, 
sand, file and vise. Once the models are trained, a sequence of features can be 
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passed to a recognizer which calculates the probability of each model given the 
observation sequence and returns the most likely gesture. For our experiments, 
the set of features consist of readings from the accelerometers positioned at the 
wrist and at the elbow. This provides 6 total continuous feature values - the x,y 
and z acceleration readings for both positions - which are then normalized to 
sum to one and collected at approximately 93 Hz. 

We found that most of the workshop activities typically require only simple 
single Gaussian HMMs for modeling. For file, sand, saw, and screw, a 5 state 
model with 1 skip transition and 1 loop-back transition suffice because they 
consist of simple repetitive motions. Drill is better represented using a 7 state 
model, while grinding is again more complex, requiring a 9 state model. The 
vise is unique in that it has two separate motions, opening and closing. Thus 
a 9 state model is used with two appropriate loop-backs to correctly represent 
the gesture (See Figure 3). These models were selected through inspection of the 
data, an understanding of nature of the activities, and experience with HMMs. 



4.2 HMM Isolation Results 

For this project, a prototype of the Georgia Tech Gesture Recognition Toolkit 
was used to train the HMMs and for recognition. The Toolkit is an interface to 
the HTK toolkit [26] designed for training HMMs for speech recognition. HTK 
handles the algorithms for training and recognizing the Hidden Markov Models 
allowing us to focus primarily on properly modeling the data. 

To test the performance of the HMMs in isolation, the shop accelerometer 
data was partitioned by hand into individual examples of gestures. Accuracy of 
the system was calculated by performing leave-one-out validation by iteratively 
reserving one sample for testing and training on the remaining samples for each 
sample. The HMMs were able to correctly classify 95.51% of the gestures over 
data collected from the shop experiments. The rates for individual gestures are 
given in Table 2. 



4.3 Sound Recognition 
Method 

The basic sound classification scheme operates on individual frames of length t w 
seconds. The approach follows a three step process: feature extraction, dimen- 
sionality reduction, and the actual classification. 

The features used are the spectral components of each t w obtained by Fast 
Fourier Transformation (FFT). This produces N = • t w dimensional feature 

vectors, where f s is sample frequency. Rather than attempting to classify such 
large TV-dimensional vectors directly, Linear Discriminant Analysis (LDA)[6] is 
employed to derive an optimal projection of the data into a smaller, M dimen- 
sional feature space (where M is the number of classes). In the “recognition 
phase” , the LDA transformation is applied to the data frame under test to pro- 
duce the corresponding M — 1 dimensional feature vector. 
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Using a labeled training-set, class means are calculated in the M — 1 dimen- 
sional space. Classification is performed simply by choosing the class mean which 
has the minimum Euclidean distance from the test feature vector (see Figure 2 
bottom right). 

Intensity Analysis 

Making use of the fact that signal intensity is inversely proportional to the square 
of the distance from its source, the ratio of the two intensities I wrist/ Ichest is 
used as a measure of absolute distance of source from the user. Assuming the 
sound source is distance d from the wrist microphone and d + S from the chest, 
the ratio of the intensities will be proportional to 

Iwrist _ {d + 6 ) 2 _ d 2 + 2d5 + J 2 2J J 2 

Ichest ~ d 2 d 2 + d + d 2 

When both microphones are separated by at least S , any sound produced at 
a distance d ( where d » 5 ) from the user will bring this ratio close to one. 
Sounds produced near the chest microphone (e.g. the user speaking) will cause 
the ratio to approach zero whereas any sounds close to the wrist mic will make 
this ratio large. 

Sound extraction is performed by sliding a window Wi a over the f s Hz resam- 
pled audio data. On each iteration, the signal energy over Wi a for each channel is 
calculated. For these windows, the difference in ratio Iwrist/ Ichest and its recipro- 
cal are obtained, which are then compared to an empirically obtained threshold 
thi a . 

The difference I wrist/ Ichest ~ Ichest/ Iwrist provides a convenient metric for 
thresholding - zero indicates a far off (or exactly equidistant) sound; while above 
or below zero indicate a sound closer to the wrist or chest microphone respec- 
tively. 



4.4 Results 

In order to analyze the performance of the sound classification, individual exam- 
ples of each class were hand partitioned from each of the 10 experiments. This 
provided at least 10 samples of every class - some classes had more samples on 
account of more frequent useage (e.g. vise). From these, two samples of each 
class were used for training while testing was performed on the rest. 

Similar work[18] used FFT parameters of f s = 4.8kHz and t w = 50 ms (256 
points), for this experiment t w was increased to 100 ms. With these parameters 
LDA classification was applied to successive t w frames within each of the class 
partitioned samples - returning a hard classification for each frame. Judging 
accuracy by the number of correctly matching frames over the total number 
of frames in each sample, an overall recognition rate of 90.18% was obtained. 
Individual class results are shown in the first column of Table 2. We then used 
intensity analysis to select those frames corresponding to where source intensity 
ratio difference surpassed a given threshold. With LDA classification applied only 
to these selected frames, the recognition improved slightly to a rate of 92.21% 
(second column of Table 2.) 
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To make a comparison with the isolated accelerometer results, a majority 
decision was taken over all individual frame results within each sample to pro- 
duce an overall classification for that gesture. This technique resulted in 100% 
recognition over the sound test data in isolation. 



Table 2. Isolated recognition accuracy (in %) for sound LDA, LDA with I A preselec- 
tion, majority decision over IA+LDA, and for acceleration based HMM 





Sound 


Acceleration 


Gesture 


LDA 


IA+LDA 


maj (IA+LDA) 


HMM 


Hammer 


96.79 


98.85 


100 


100 


Saw 


92.71 


92.98 


100 


100 


Filing 


69.68 


81.43 


100 


100 


Drilling 


99.59 


99.35 


100 


100 


Sanding 


93.66 


92.87 


100 


88.89 


Grinding 


97.77 


97.75 


100 


88.89 


Screwing 


91.17 


93.29 


100 


100 


Vise 


80.10 


81.14 


100 


92.30 


Overall 


90.18 


92.21 


100 


95.51 



0.500 




5 Continuous Recognition 

Recognition of gestures from a continuous stream of features is difficult. How- 
ever, we can simplify the problem by partitioning the continuous stream into 
segments and attacking the problem as isolated recognition. This approach re- 
quires a method of determining a proper partitioning of the continuous stream. 
We take advantage of the intensity analysis described in the previous section as 
a technique for identifying appropriate segments for recognition. 
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Since neither LDA nor the HMM are perfect at recognition, and each is able to 
recognize a different set of gestures well due to working in different feature space, 
it is advantageous to compare their independent classifications of a segment. If 
the classification of the segment by the HMMs matches the classification of the 
segment by the LDA, the classification can be believed. Otherwise, the noise class 
can be assumed, or perhaps a decision appropriate to the task can be taken (such 
as requesting additional information from the user). 

Thus, the recognition is performed in three main stages: 1) Extracting poten- 
tially interesting partitions from the continuous sequence, 2) Classifying these 
individually using the LDA and HMMs, and 3) Combining the results from these 
approaches. 



5.1 LDA for Partitioning 

For classification, partitioned data needs to be arranged in continuous sections 
corresponding to a single user activity. Such partitioning of the data is obtained 
in two steps: First, LDA classification is run on segments of data chosen by the 
IA. Those segments not chosen by intensity analysis are returned with classifi- 
cation zero. (In this experiment, classifications are returned at the same rate as 
accelerometer features); Secondly, these small window classifications are further 
processed by a larger (several seconds) majority decision window, which returns 
a single result for the entire window duration. 

This partitioning mechanism helps reduce the complexity of continuous 
recognition. It will not give accurate bounds on the beginning and end of a 
gesture. Instead, the goal is to provide enough information to generate context 
at a general level, i.e., “The user is hammering” as opposed to “A hammering 
gesture occurred between sample 1500 and 2300.” The system is tolerant of, 
and does not require, perfect alignment between the partitions and the actual 
gesture. The example alignment shown in Figure 4 is acceptable for our purposes. 



5.2 Partitioning Results 

Analysis of the data was performed to test the system’s ability to reconstruct 
the sequence of gestures in the shop experiments based on the partitioning and 
recognition techniques described to this point. Figure 5 shows an example of the 
automated partitioning versus the actual events. The LDA classification of each 
partition is also shown. For this analysis of the system, the non-tool gestures, 
drawer and clapping, were considered as part of the noise class. After apply- 
ing the partition scheme, a typical shop experiment resulted in 25-30 different 
partitions. 



5.3 HMM Classification 

Once the partitions are created by the LDA method, they are passed to set of 
HMMs for further classification. For this experiment, the HMMs are trained on 
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individual gestures from the shop experiments using 6 accelerometer features 
from the wrist and elbow. Ideally, the HMMs will return a single gesture classifi- 
cation for each segment. However, the segment sometimes includes the beginning 
or end of the next or previous gesture respectively, causing the HMMs to return 
a sequence of gestures. In such cases, the gesture which makes up the majority 
of the segment is used as the classification. For example the segment labeled “B” 
in Figure 4 may return the sequence “hammer vise” and would then be assigned 
as the single gesture “vise.” 



5.4 Combining LDA and HMM Classification 

For each partitioned segment, the classification of the LDA and HMM methods 
were compared. If the classifications matched, that classification was assigned 
the segment. Otherwise, the noise class was returned. 



Drawer Hammer Vice 

Actual I | 

LDA Partition . I ^ 



Fig. 4. Detailed example of LDA partitioning 



Workshop dataset #7 




Fig. 5. LDA partitions versus ground truth on a typical continuous dataset 
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Table 3. Continuous recognition accuracy per gesture (Correct | Insertions | Deletions 
| Substitutions | Accuracy) and probability of gesture given classification P(G| Class) 



Gesture 


HMM 


LDA 


HMM + LDA 




C 


I 


D 


s 


Ace 


C 


I 


D 


S 


Ace 


C 


I 


D 


S 


Ace 


P(G|Class) 


Hammer 


8 


2 


"o 


1 


66.7 


9 


1 


"o 


0 


88.9 


8 


0 


1 


0 


88.9 


1.00 


Saw 


9 


0 


0 


0 


100 


9 


1 


0 


0 


88.9 


9 


0 


0 


0 


100 


1.00 


Filing 


10 


0 


0 


0 


100 


9 


7 


0 


1 


23.2 


9 


0 


1 


0 


90 


1.00 


Drilling 


9 


7 


0 


0 


22.2 


9 


1 


0 


0 


88.9 


9 


0 


0 


0 


100 


1.00 


Sanding 


8 


0 


0 


1 


77.8 


9 


8 


0 


0 


11.1 


8 


0 


1 


0 


88.9 


1.00 


Grinding 


11 


13 


0 


o' 


-18.2 


9 


0 


0 


2 


81.8 


9 


0 


2 


0 


81.8 


1.00 


Screw 


5 


1 


0 


4 


44.4 


9 


75 


0 


0 


-733.3 


4 


0 


5 


0 


44.4 


1.00 


Vise 


42 


0 


0 


T 


97.7 


34 


1 


2 


7 


76.6 


36 


0 


7 


0 


83.7 


1.00 


Overall 


102 


23 


0 


y 


72.5 


97 


94 


T 


10 


2.8 


92 


0 


17 


0 


84.4 


1.00 



Table 3 shows the number of correct classifications (C), insertions (I), dele- 
tions (D), and substitutions(S) for the HMMs, the LDA, and the combination. 
Insertions are defined as noise gestures identified as a tool gesture. Deletions are 
tool gestures recognized as noise gestures. A substitution for a gesture occurs 
when that gesture is incorrectly identified as a different gesture. In addition, the 
accuracy of the system is calculated based on the following metric: 

_ Correct — Insertions 

XAccuracy = TotalSample$ 

The final column reports the probability of a gesture having occurred given 
that the system reported that gesture. 

Clearly, the HMMs and LDA each perform better than the other on various 
gestures and tended to err in favor of a particular gesture. When incorrect, LDA 
tended to report the “screw” gesture. Similarly, the HMMs tended to report 
“grinding” or “drilling.” Comparing the classification significantly helps address 
this problem and reduce the number of false positives, thus increasing the per- 
formance of the system as a whole. The data shows that the comparison method 
performed better than the HMMs and the LDA in many cases and improved the 
accuracy of the system. 

6 Discussion 

Although the accuracy of the system in general is not perfect, it is important 
to note that the combined HMM + LDA method results in no insertions or 
substitutions. This result implies that when the system returns a gesture, that 
gesture did occur. While the system still misses some gestures, the fact that 
it does not return false positives allows a user interface designer to be more 
confident in his use of positive context. 

Of course for many applications deletions are just as undesirable as false 
positives. In a safety monitoring scenario for example, any deletions of alarm 
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or warning events would naturally be unnaceptable. In such cases it would be 
better for the system to return some warning, however erroneous, rather than 
none at all. On the other hand, if one sensor is known to produce many false 
positives in particular circumstances, whereas another is known to be extremely 
reliable for the same, then some means of damping the influence of the first in 
favour of the second sensor would be desirable. 

The simple fusion scheme described in this paper could be modified to ac- 
comodate these issues by weighting sensor inputs based on knowledge of their 
reliability in given circumstances. Such weighting, together with decision like- 
lihood information from individual classifiers, would allow a more intelligent 
fusion scheme to be developed. This will be the focus of future work. 

7 Conclusion 

We have shown a system capable of segmenting and recognizing typical user 
gestures in a workshop environment. The system uses wrist and chest worn mi- 
crophones and accelerometers, leveraging the feature attributes of each modality 
to improve the system’s performance. For the limited set analyzed, the system 
demonstrated perfect performance in isolated gesture testing and a zero false 
positive rate in the continuous case. In the future, we hope to apply these promis- 
ing techniques, together with more advanced methods for sensor fusion, to the 
problem of recognizing everyday gestures in more general scenarios. 
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Abstract. As the proliferation of pervasive and ubiquitous computing devices 
continues, users will carry more devices. Without the ability for these devices to 
unobtrusively interact with one another, the user’s attention will be spent on co- 
ordinating, rather than using, these devices. We present a method to determine 
if two devices are carried by the same person, by analyzing walking data re- 
corded by low-cost MEMS accelerometers using the coherence function, a 
measure of linear correlation in the frequency domain. We also show that these 
low-cost sensors perform similarly to more expensive accelerometers for the 
frequency range of human motion, 0 to 10Hz. We also present results from a 
large test group illustrating the algorithm’s robustness and its ability to with- 
stand real world time delays, crucial for wireless technologies like Bluetooth 
and 802.11. We present results that show that our technique is 100% accurate 
using a sliding window of 8 seconds of data when the devices are carried in the 
same location on the body, is tolerant to inter-device communication latencies, 
and requires little communication bandwidth. In addition we present results for 
when devices are carried on different parts of the body. 



1 Introduction 

For the past 30 years, the dominant model for using our computing devices has been 
interactive. This approach puts the human in a feedback loop together with the com- 
puter. A user generates input and the computer responds through an output device, 
this output is then observed by the user who reacts with new input. When the ratio of 
humans to computing devices was close to 1:1, this was a reasonable approach. Our 
attention was commanded by one device at a time, our desktop, laptop, or handheld. 
This was appropriate as our tasks often involved manipulating information on the 
computer’s screen in word processing, drawing, etc. 

Today, the conditions of human-computer interaction are rapidly changing. We 
have an ever-increasing number of devices. Moreover, they are becoming deeply 
embedded into objects, such as automobiles. Many of these devices have a powerful 
CPU inside of them, however, we do not think of them as computing devices. 

There are two main implications of this explosion in the number of computing de- 
vices. First, the human user can no longer be in the loop of every interaction with and 
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between these devices; there are just too many, the interactive model simply does not 
scale. Devices must share information appropriately or they will end up demanding 
even more of our time. To further complicate matters, each device will necessarily 
have a different specialized user interface due to their different functions and form- 
factor. Second, as these new devices are embedded in other objects, we often are not 
even aware there are computing devices present, because our focus is on our task, not 
on the devices. 

Invisibility is an increasingly important aspect of user interaction, with the princi- 
pal tenet being “do not distract the user from the task at hand”. An example of this is 
package delivery, which now includes tablet-like computers to collect signatures, 
RFID tracking of packages, and centralized databases to provide web services to cus- 
tomers, such as the current location of their parcel. Delivery truck drivers, cargo han- 
dlers, and recipients do not want a user interface to slow down a package in reaching 
its destination. They prefer if the devices gather input, explicitly or implicitly, and 
communicate the data amongst themselves. There is no reason for users to take an 
interactive role with all the steps, nor do they want to. Another example draws on 
devices becoming so cheap that they are viewed as a community resource. In hospi- 
tals, nurses and physicians carry clipboards, charts, and folders that could provide 
more timely information if they were electronic devices connected to the hospital’s 
infrastructure. Many individuals would use these devices as they use their paper ver- 
sions now. One way to enhance the user interaction with these devices would be for 
the devices themselves to recognize whom they were being carried by. 

Motivated by these examples, we are investigating methods for devices to deter- 
mine automatically when they should interact or communicate with each other. Our 
goal is to enable devices to answer questions such as: 

• Is the same person carrying two devices? With what certainty? 

• Are two devices in the same room? For how long? 

• Are two devices near each other? How near? 

• What devices did I have with me when I came in? When I went out? 

Different applications will want answers to a different set of these and other ques- 
tions. We are developing a toolkit of technologies and methods that can be used by 
interaction designers to create systems with a high degree of invisible interactions 
between many devices. This paper presents our work on developing methods for an- 
swering the first question. 

We assume a world where a user will carry a changing collection of devices 
throughout a day. These might include a cell phone, a laptop, a tablet, and a handheld. 
In addition, the collection may include more specialized devices such as RFID or 
barcode scanners, GPS receivers, wrist-watch user interfaces, eyeglass-mounted dis- 
plays, headphones, etc. These devices may be tossed into a pocket, strapped to cloth- 
ing, worn on a part of the body, or placed in a backpack or handbag. We posit that it 
will be an insignificant addition to the cost of these devices if they include a 3 -axis 
accelerometer. We also expect these devices to have a means of communicating with 
each other through wireless links, such as Bluetooth or 802.11. Recent work in wire- 
less sensor networks is demonstrating that the communicating nodes may become as 
small as “smart dust”[l] and function with a high degree of power efficiency. 
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Here we consider the problem of how easily and reliably two such devices can 
autonomously determine whether or not they “belong” to the same user by comparing 
their acceleration over time. If the acceleration profiles of the two devices are similar 
enough, they should be able to conclude that the same person is currently carrying 
them. In practical situations, one of these devices may be a personal one (e.g., a wrist- 
watch or pager on a belt that is “always” with the same person) while the other is a 
device picked-up and used for a period of time. The question we posed is: “How re- 
liably can accelerometer data be used to make the determination that two devices are 
on the same person?” 



2 Related Work 

Many techniques exist that could possibly be used to answer our question. We could 
use capacitive-coupling techniques to determine if two devices are touching the same 
person [2], however, this requires direct physical contact with both devices and is 
highly dependent on body geometry and device placement. A second approach is to 
use radio signal strength [3] to determine proximity of two devices. However, RF 
signal strength is not a reliable measure of distance and is also highly dependent on 
body orientation and placement of devices. Furthermore, these RF signals could be 
received by nearby unauthorized devices on another person or in the environment. 

The approach we investigate in this paper is to directly measure the acceleration 
forces on two devices and then compare them over a sliding time window. There has 
been much previous work in using accelerometers for gesture recognition and device 
association. We describe the contributions of three different pieces of related work: 
two from the ubiquitous computing research community and one from bioengineering 
instrumentation. 

Gesture recognition using accelerometers has been used to develop a wand to re- 
mote control devices in a smart space [4] and a glove that uses sensing on all the fin- 
gers to create an “air keyboard” for text input [5]. This work is primarily concerned 
with using accelerometer data as part of the process in computing the position of an 
object, in these cases, a plastic tube or finger segments, respectively. By observing 
the variations in position over time, gestures can be recognized. 

Device association is the process by which two devices decide whether they should 
communicate with each other in some way. Work at TeCO used accelerometers to 
create smart objects (Smart-Its) that could detect when they were being shaken to- 
gether [6]. The idea was to associate two devices by placing them together and shak- 
ing the ensemble. Similar accelerations on the two devices would allow the connec- 
tion to be established. The assumption is that it would be unlikely that two devices 
would experience the same accelerations unintentionally. Hinckley has developed a 
similar technique that uses bumping rather than shaking [7]. In both of these cases, 
the analysis of the accelerometer data is in the time domain, which can be sensitive to 
latencies in communication between the devices. Both techniques also have similari- 
ties in that the decision is strictly binary, instead of computing a probability that the 
two devices are being intentionally associated. The principal difference between this 
work and ours is more fundamental, while these two contributions exploit explicit 
user-initiated interactions (shaking or bumping) our focus is on making the determina- 
tion implicitly and independent of the user’s attention. 
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The work that is closest to ours, and provided much of our inspiration, used accel- 
erometers to determine if the trembling experienced by a patient with Parkinson’s 
Disease was caused by a single area of the brain or possibly multiple areas of the 
brain [8]. Physicians developed accelerometer sensors that were strapped to patients’ 
limbs, data was collected, and an off-line analysis determined if the shaking was cor- 
related in the frequency domain. Shaking in the same limb was found to be highly 
correlated; however, shaking across limbs was found to be uncorrelated. The key 
observation in this work is that Parkinson’s related shaking is likely due to multiple 
sources in the brain that may be coupled to each other. We use a very similar ap- 
proach, but with an on-line algorithm which can be running continuously within the 
devices being carried rather than strapped to the body. 

Researchers attempting to identify activities in real-time have tried to identify ac- 
tivities such as: standing, sitting, walking, lying down, climbing stairs, etc. using 
more structured placement of multiple sensors and analysis methodologies like neural 
networks and Markov models. See [9] for a detailed overview of this work. 



3 Our Approach 

In order to provide a useful detection tool that doesn’t require any user interaction the 
input to our system must come from an existing, natural action. In this paper we focus 
on the activity of human walking. 

Although there are a number of different actions that a person regularly performs, 
walking provides a useful input because of its periodic nature. Human locomotion is 
regulated by the mechanical characteristics of the human body [10] as much as con- 
scious control over our limbs. This regular, repeated activity lends itself to an analysis 
in the frequency domain, which helps reduce the effect of problems like communica- 
tion latencies, device dependent thresholds, or the need for complex and computation- 
ally expensive analysis models. 

We have two aims in this paper. First, we want to assess the quality of acceleration 
measurements obtained from low cost accelerometers, to ensure that they are appro- 
priate for this application and that their measurements have a physical basis. Second, 
we want to determine whether there is sufficient information in the accelerations of 
two devices to determine whether they are being carried together. 



4 Methods 

Three different 3 -axis MEMS accelerometers were used for our experiments, two 
were low cost commercial accelerometers from Analog Devices [5, 6] and STMicro- 
electronics, and the third was a calibrated accelerometer from Crossbow Technolo- 
gies. Table 1 lists the accelerometers along with some specifications for each acceler- 
ometer. 




